PERF faster head, tail and size groupby methods #5533

hayd · 2013-11-17T04:47:17Z

Try again with #5518.

Massive gains in groupby head and tail, adds more tests for these, slight speed improvement in size (not as much as I'd hoped, basically iterating through grouper.indices is slow :( ).

As mentioned before, I added a helper function to prepend as_index to index (if that makes any sense), I think it could be faster, and also using it in apply may fix some bugs there...

hayd · 2013-11-18T00:15:17Z

@jreback I fixed this up. A slight api change is that head now respects original frame order (it doesn't if you do an apply).

jreback · 2013-11-18T13:51:01Z

can you add some tests where the groups vary in size and where the number you are asking for (e.g. head(3)) is > than the number in some/all groups.

hayd · 2013-11-18T21:14:07Z

Added tests for <=0 and > max group size. One already for between group sizes.

(though it's a small example, I think it covers...)

jorisvandenbossche · 2013-11-19T08:55:40Z

pandas/core/groupby.py

+        '''
+        Returns first n rows of each group.
+
+        Essentially equivalent to .apply(lambda x: x.head(n))


Can you double backquote the code (.apply(..)) so it is rendered as code?

jorisvandenbossche · 2013-11-19T09:11:28Z

Added some comments to the docstrings.

I have another question regarding the docstring style:

You changed the """ to ``'''. Is this recommended, or is there some 'policy' on this in pandas? Because I always use """` (and this is also used in PEP8: http://www.python.org/dev/peps/pep-0008/#documentation-strings).
It doesn't matter at all for me what we use, but maybe we should try to be consistent, and I had the impression that `"""` is used the most in the pandas source code

hayd · 2013-11-19T18:38:19Z

@jorisvandenbossche Thanks for comments, will update.

hayd · 2013-11-19T21:50:11Z

Fixed these. Mentioned the ascending arg to cumcount, which I've purposely made kwarg only for now.

Noticed a weird related thing with nth on a DataFrame (it just doesn't work, and is kinda undefined), will make sep issue though. #5552

hayd · 2013-11-20T01:26:39Z

pandas/core/groupby.py

@@ -474,6 +473,10 @@ def ohlc(self):
        return self._cython_agg_general('ohlc')

    def nth(self, n):
+        """
+        Return the nth row of each group


The thing I noticed makes this a complete lie. Ooops, will delete then merge.

PERF faster head, tail and size groupby methods

PERF faster head, tail and size groupby methods

e8e7735

jorisvandenbossche reviewed Nov 19, 2013
View reviewed changes

TST more coverage for groupby head and tail

ef38319

hayd reviewed Nov 20, 2013
View reviewed changes

hayd added a commit that referenced this pull request Nov 20, 2013

Merge pull request #5533 from hayd/groupby_head_tail

e5e53ba

PERF faster head, tail and size groupby methods

hayd merged commit e5e53ba into pandas-dev:master Nov 20, 2013

hayd deleted the groupby_head_tail branch November 20, 2013 01:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF faster head, tail and size groupby methods #5533

PERF faster head, tail and size groupby methods #5533

hayd commented Nov 17, 2013

hayd commented Nov 18, 2013

jreback commented Nov 18, 2013

hayd commented Nov 18, 2013

jorisvandenbossche Nov 19, 2013

jorisvandenbossche commented Nov 19, 2013

hayd commented Nov 19, 2013

hayd commented Nov 19, 2013

hayd Nov 20, 2013

PERF faster head, tail and size groupby methods #5533

PERF faster head, tail and size groupby methods #5533

Conversation

hayd commented Nov 17, 2013

hayd commented Nov 18, 2013

jreback commented Nov 18, 2013

hayd commented Nov 18, 2013

jorisvandenbossche Nov 19, 2013

Choose a reason for hiding this comment

jorisvandenbossche commented Nov 19, 2013

hayd commented Nov 19, 2013

hayd commented Nov 19, 2013

hayd Nov 20, 2013

Choose a reason for hiding this comment